for loop and while loopA Python script is a file containing Python code that can be executed all at once.
.pygenotype = "AG"
phenotype = "expressed"
if genotype == "AG":
# Only check phenotype if genotype is "AG"
if phenotype == "expressed":
print("Variant is active and expressed.")
else:
print("Variant is inactive.")
else:
print("Non-target variant.")
TP53 COX2 EGFR MTOR
Put some of your Python code you have written yesterday and save it in a Python script, run it in a terminal with either
python yourscript.py
or
./yourscript.py
yourscript.py with the actual name you use#!/usr/bin/env python at the beginning of the script and also make the script executable in order to run it with the second method file = open("filename.txt", "r")
content = file.read()
file.close()
'r' specifies the mode for opening the file as a read-only text file. If the file does not exist, a FileNotFoundError will be raised.
Other types of mode:
'rb': Opens the file as a binary file for reading.'r+ : Opens the file for reading and writing.Extra reading: https://www.w3schools.com/python/python_file_handling.asp
with keyword to ensure the file handle be closed automatically¶file = open("filename.txt", "r")
content = file.read()
file.close()
with open("filename.txt", "r") as file:
content = file.read()
encoding='utf-8' when opening a file in Python to ensure proper handling of text files that contain non-ASCII characters, such as Chinese characters.¶with open ("filename.txt", "r", encoding='utf-8') as file
content = file.read()
I have a file contains DNA sequences and want to use Python to read its content and then calculate the number of nucleotides, that is, get the length of the dna sequence.
seqfile = "../files/one_dna_sequence.fa"
with open(seqfile, "r") as file:
for line in file:
print(line)
>SEQUENCE_1 AGCTTAGCTAAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGC TTAGCTAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGCTTAGCTAAGCTTA GCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAAGCTTAGCT AGCTAGCTAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTT AGCTAAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGCTTAGC TAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAA GCTTAGCTAGCTAGCTAGCTTAGCTA
string¶'string'.strip() Removes whitespace
'string'.split() Splits on whitespace into list
string1 = ' an example string with whitespaces at both ends '
string2 = string1.strip()
string2
'an example string with whitespaces at both ends'
list1 = string2.split()
list1
['an', 'example', 'string', 'with', 'whitespaces', 'at', 'both', 'ends']
list2 = string1.strip().split()
list2
['an', 'example', 'string', 'with', 'whitespaces', 'at', 'both', 'ends']
file = open("output.txt", "w")
file.write("Hello, Python!")
file.close()
'w': Opens a file for writing only. If the file does not exist, it creates a new file. If the file exists, its previous content will be truncated, effectively deleting the content before the new write operations.Other modes:
'a': Opens the file for writing, appending any new data you write to the end of the file's existing content, thus preserving the previous content.Extra reading: https://www.w3schools.com/python/python_file_handling.asp
with open("output.txt", "w") as file:
file.write("Hello, Python!")
with open("output.txt", "w", encoding = 'utf-8') as file:
file.write("Hello, Python!")
seqfile = "../files/one_dna_sequence.fa"
with open(seqfile, "r") as file:
seqlength = 0
for line in file:
line = line.strip()
if not line.startswith(">"):
seqlength += len(line)
outfile = "../files/output/length_of_dna_sequence.txt"
with open(outfile, "w") as file:
file.write("Length of DNA sequence: " + str(seqlength))
Link: https://python-bioinfo.bioshu.se/exercises.html
You have a VCF file with a number of samples. You are interested in only one of the samples (sample1) and one region (chr5, 1,200,000-1,300,000). What you want to know is whether this sample has any variants in this region, and if so, which variants.
0/1
0 means DNA at this position is the same as reference allele (A), 1 means has alternative allele, in this case AC The notation of alternative allele is not always 1
0/22 means it is the second alternative allele, i.e. C in this case#)5 and position is between 1,200,000 and 1,300,000Open the file and loop over lines (ignore lines starting with #), and print the first record
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
counter = 0
for line in fh:
if not line.startswith('#'):
print(line.strip())
break
# Next, find chromosome 5
1 762969 rs115616822 G A . PASS AA=g;AC=1;AN=120;DP=226;GP=1:773106;BN=132 GT:DP:CB 0/0:1:SMB 0/0:6:SMB 0/0:3:SMB 0/0:5:SMB 0/0:15:SMB 0/0:6:SMB 0/0:4:SMB 0/0:1:SMB 0/0:11:SMB 0/0:0:SMB 0/0:3:SMB 0/0:3:SMB 0/0:5:SMB 0/0:2:SMB 0/0:8:SMB 0/0:2:SMB 0/0:4:SMB 0/0:2:SMB 0/0:4:SMB 0/0:1:SMB 0/0:6:SMB 0/0:3:SMB 0/0:6:SMB 0/0:1:SMB 0/0:2:SMB 0/0:6:SMB 0/0:7:SMB 0/0:0:SMB 0/0:0:SMB 0/0:3:SMB 0/0:4:SMB 0/0:3:SMB 0/0:2:SB 0/0:6:SMB 0/0:6:SMB 0/0:1:SMB 0/0:1:SMB 0/0:0:SMB 0/0:8:SMB 0/0:7:SMB 0/0:0:SMB 0/0:1:SMB 0/0:3:SMB 0/0:0:SMB 0/0:8:SMB 0/0:5:SMB 0/0:4:SMB 0/0:0:SMB 0/0:8:SMB 0/0:0:SMB 0/0:4:SMB 0/0:5:SMB 0/0:7:SMB 0/0:4:SMB 0/0:2:SMB 0/0:0:SMB 0/0:2:SMB 0/0:2:SMB 0/0:5:SMB 0/1:8:SMB
Identify lines where chromosome is 5 and position is between 1,200,000 and 1,300,000
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
for line in fh:
if not line.startswith('#'):
cols = line.strip().split('\t')
if cols[0] == '5':
print(line)
break
# Next, find the correct region
5 106565 rs115608877 G T . PASS AA=.;AC=7;AN=120;DP=91;GP=5:53565;BN=132 GT:DP:CB 1/0:1:SM 1/0:3:SM 0/0:1:SM 0/0:2:SM 0/0:0:S 0/0:0:SM 0/0:0:SM 0/0:1:S 0/0:0:S 0/0:0:S 0/0:2:SM 0/1:1:SM 0/0:3:SM 0/1:0:SM 0/0:2:SM 1/0:1:S 0/0:1:SM 0/0:0:SM 0/0:2:SM 0/0:1:SM 0/0:5:SM 0/0:0:S 0/0:1:SM 0/0:0:S 0/0:1:SM 0/0:1:SM 0/1:1:SM 0/1:1:SM 0/0:2:SM 0/0:3:SM 0/0:2:SM 0/0:8:SM 0/0:0:SM 0/0:7:SM 0/0:4:SM 0/0:0:S 0/0:2:S 0/0:0:SM 0/0:3:SM 0/0:2:SM 0/0:1:SM 0/0:2:SM 0/0:3:S 0/0:0:SM 0/0:3:SM 0/0:1:SM 0/0:0:SM 0/0:0:SM 0/0:3:S 0/0:1:SM 0/0:0:SM 0/0:0:SM 0/0:3:SM 0/0:4:SM 0/0:2:SM 0/0:1:SM 0/0:0:SM 0/0:0:SM 0/0:1:SM 0/0:2:SM
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
for line in fh:
if not line.startswith('#'):
cols = line.strip().split('\t')
chrom = cols[0]
pos = int(cols[1])
if chrom == '5' and 1200000 <= pos <= 1300000:
print(cols)
break
# Next, find the genotypes for sample1
['5', '1207129', 'rs111580366', 'G', 'A', '.', 'PASS', 'AA=G;AC=1;AN=120;DP=329;GP=5:1154129;BN=132', 'GT:DP:CB', '0/0:4:SMB', '0/1:17:SMB', '0/0:9:SMB', '0/0:9:SMB', '0/0:6:SMB', '0/0:8:SMB', '0/0:4:SMB', '0/0:4:SMB', '0/0:7:SMB', '0/0:1:SMB', '0/0:14:SMB', '0/0:3:SMB', '0/0:5:SMB', '0/0:7:SMB', '0/0:9:SMB', '0/0:10:SMB', '0/0:6:SMB', '0/0:2:SMB', '0/0:4:SMB', '0/0:5:SMB', '0/0:13:SMB', '0/0:2:SMB', '0/0:9:SMB', '0/0:3:SMB', '0/0:4:SMB', '0/0:4:SMB', '0/0:6:SMB', '0/0:5:SMB', '0/0:1:SMB', '0/0:6:SMB', '0/0:3:SMB', '0/0:3:SMB', '0/0:8:SMB', '0/0:5:SMB', '0/0:10:SMB', '0/0:2:SMB', '0/0:7:SMB', '0/0:0:SMB', '0/0:6:SMB', '0/0:3:SMB', '0/0:5:SMB', '0/0:5:SMB', '0/0:12:SMB', '0/0:8:SMB', '0/0:15:SMB', '0/0:5:SMB', '0/0:2:SMB', '0/0:1:SMB', '0/0:6:SMB', '0/0:2:SMB', '0/0:5:SMB', '0/0:1:SMB', '0/0:3:SMB', '0/0:6:SMB', '0/0:4:SMB', '0/0:5:SMB', '0/0:1:SMB', '0/0:3:SMB', '0/0:3:SMB', '0/0:3:SMB']
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
for line in fh:
if not line.startswith('#'):
cols = line.strip().split('\t')
chrom = cols[0]
pos = int(cols[1])
if chrom == '5' and 1200000 <= pos <= 1300000:
geno = cols[9]
print(geno)
break
# Next, extract the genotypes only
0/0:4:SMB
Extract the genotypes only from the column
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
for line in fh:
if not line.startswith('#'):
cols = line.strip().split('\t')
chrom = cols[0]
pos = int(cols[1])
if chrom == '5' and 1200000 <= pos <= 1300000:
geno = cols[9].split(':')[0]
print(geno)
break
# Next, find in which positions sample1 has alternate alleles
0/0
Check if the genotype contains any alternate alleles
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
for line in fh:
if not line.startswith('#'):
cols = line.strip().split('\t')
chrom = cols[0]
pos = int(cols[1])
if chrom == '5' and 1200000 <= pos <= 1300000:
geno = cols[9].split(':')[0]
if geno not in ["0/0"]:
print(geno)
#Next, print nicely
1/1 1/1
Print any variants containing alternate alleles for this sample between specified region
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
for line in fh:
if not line.startswith('#'):
cols = line.strip().split('\t')
chrom = cols[0]
pos = int(cols[1])
if chrom == '5' and 1200000 <= pos <= 1300000:
geno = cols[9].split(':')[0]
if geno not in ["0/0"]:
ref = cols[3]
alt = cols[4]
info = f"{chrom}:{pos}_{ref}-{alt} has genotype: {geno}"
print(info)
5:1235651_C-T has genotype: 1/1 5:1277795_G-C has genotype: 1/1
source: mayoclinic.org
source: cff.org
plaintext
>MT dna:chromosome chromosome:GRCh38:MT:1:16569:1 REF
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes]
. if not applicable.Some attributes (always semi-colon separated key-value pairs):
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes]
1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";